House Sales in King County - Variable importance analysis

Author: Piotr Grabysz

Date: 2021-05-07

In this notebook I train a few models (described later) and I investigate their variable importance. The models are trained to predict houses' prices based upon the following features:

The models:

1) XGBoost

2) Random forest

Then I train the same models, but I learn them to predict $log(price)$. That's because the distribution of prices is very far from normal: almost all samples are centered near the medium prices, but there are some few times more expensive samples. So I thought that taking the logarithm of the prices might help I wanted to check this out.

3) XGBoost on $log(price)$

4) RandomForrest on $log(price)$

The models' performance

Taking logarithm of prices slightly improves XGBoost model, by the difference is so small that it might be up to random sampling. All four models have similar perfomance (on the test set selected from the data set in the Appendix).

Variable importance of the models

We can see that 4 features: lat, long, sqft_living, grade are most important for all models, they only differ in the order. However, lat is the most important for all of them. Waterfront is 5th most important for three of these models.

These results are consintent with my previous findings on instance level: SHAP and LIME, where I compared three houses with very different prices (very low price, average price and extremaly high price).

Recap of the SHAP analysis. The graphs show SHAP values for XGBoost model, although the picture was very similar for the other models. The order of the features is similar to the global level variable importance.

Diffently than for variable importance, long is more important than lat. But in general we can see that grade, sqft_living and geographical location are the most important from the point of view of the whole model as well as from the point of view of (deliberately selected) samples.

The fact that latitude is more important at global level while longitude seems to be more important at instance level needs some explanation. It happened that my three sample instances lie at similar latitude, so only changing their longitude can push them towards centre of the town. But if I chose house lying at the bottom of the map, latitude would be probably the most important.

Lets return to the Variable Importance. We recall that lat, long, sqft_living and grade are the most important. Impact of lat and long is shown above. The influence of grade and sqft_living is also very visible in the data:

Appendix

Imports

Loading data

Train/test split (for models comparison)

Training the model - XGBoost

I train XGBoost model with parameters I found performing well in the previous homework

Training the model - Random Forrest

Training the model - XGB after log transforming prices

Training the model - random forest after log transforming prices

Comparison of models' performance

Variable importance